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[57] ABSTRACT 

A computer-based system and method of retrieving infor- 
mation pertaining to Web documents on a computer network 
is disclosed. The method includes maintaining an address 
map that associates primary addresses with secondary 
addresses. A primary address includes a network retrieval 
protocol and a network address. The secondary address may 
include a different retrieval protocol or a different network 
address from the primary document address. A Web crawler 
retrieves a Web document using the primary document 
address, and determines whether the address map contains a 
secondary document address prefix corresponding to the 
primary document address prefix. If a secondary document 
address prefix exists, the Web crawler creates a secondary 
address, retrieves additional information pertaining to the 
Web document, and combines the additional information 
with the data retrieved from the Web document. The com- 
bined data may be stored in an index, and subsequently used 
to perform a document search. 

28 Claims, 5 Drawing Sheets 
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METHOD OF WEB CRAWLING UTILIZING document containing information pertaioing to the first Web 

ADDRESS MAPPING document. HTTP does not provide an easy mechanism for 

obtaining related data from multiple sources and combining 

FIELD OF THE INVENTION the data. 

5 It is desirable to have a mechanism by which a Web 

The present invention relates to the field of network crawler can increase the amount of inform au'on it obtains for 

information software and, in particular, to methods and each Web document. Preferably, such a mechanism will 

systems for retrieving data from network sites. provide a Web crawler with a way to obtain inform ation 

pertaining to a Web document by using more than one 

BACKGROUND OF THE INVENTION iq protocol. Additionally, a preferable mechanism will also 

T . t • ■ . , it provide a Web crawler with a way to obtain information 

In recent years, there has been a tremendous proliferation £ er[ainin , 0 , Web document fro J a other than the 

ot computers connected to a global network known as the Web documen , ilself ^ 

present invention is directed to 
Internet. A "client computer connected to the Internet can prov iding such a mechanism, 
download digital information from "server" computers con- 
nected to the Internet. Client application software executing is SUMMARY OF THE INVENTION 
on client computers typically accept commands from a user In accordance with this invention, a system and computer- 
and obtains data and services by sending requests to server based method of retrieving data from a computer network 
applications running on server computers connected to the are provided. The method includes performing a Web crawl, 
Internet. A number of protocols are used to exchange com- by retrieving a Web document and subsequently retrieving 
mands and data between computers connected to the Inter- 20 additional Web documents based on addresses specified in 
net. The protocols include the File Transfer Protocol (FTP), hyperlinks within each Web document. For each Web 
the Hyper Text Transfer Protocol (HTTP), the Simple Mail document, an address map is checked to determine whether 
Transfer Protocol (SMTP), and the "Gopher" document the document has a secondary document address corre- 
protocol. sponding to the first, or primary, document address. If a 
The HTTP protocol is used to access data on the World 25 secondary document address exists, the secondary document 
Wide Web, often referred to as "the Web." The World Wide address * used 10 retrieve dala Plaining to the Web 
Web is an information service on the Internet providing document. 

documents and links between documents. The World Wide In accordance with other aspects of this invention, a 

Web is made up of numerous Web sites around the world document address includes a protocol specification and a 

that maintain and distribute Web documents. A Web site may 30 network address specification. The secondary document 

use one or more Web server computers that are store and address may differ from the primary document address by 

distribute documents in one of a number of formats includ- having different specified protocols, different network 

ing the Hyper Text Markup Language (HTML). An HTML addresses, or both. The secondary document address allows 

document contains text and metadata or commands provid- the retrieval of data not easily obtained using the first 

ing formatting information. HTML documents also include 35 document address. The additional data may include data 

embedded "links" that reference other data or documents obtainable by using file system commands, 

located on any Web server computer. The referenced docu- In accordance with still other aspects of this invention, 

ments may represent text, graphics, audio, or video in after retrieving a Web document using a primary document 

respective formats. address and additional data pertaining to the Web document 

A Web browser is a client application that communicates 40 a secondary document address, the data obtained using 

with server computers via FTP, HTTP, and Gopher proto- the secondary document address is stored with the data 

cols. Web browsers receive Web documents from the net- obtained from the Web document, rhe combined data may 

work and present them to a user. Internet Explorer, available be slored » document index which is subsequently used 

from Microsoft Corporation, of Redmond, Wash., is an c t0 ,_ loca,e the Web document. In accordance with yet still 

example of a popular Web browser application. 45 other aspects of this invention an address map » mam- 

. . . , . . ,,, . tamed. The address map preferably includes a set of entries, 

An intranet is a local area network conta.ning Web servers each havi a ion of a ta Web address and a 

^ t «? u 6 T ° Pe u a ? g u 1,1 * T™,? 1 .I' '? ^ corresponding portion of a secondary Web address. 

World Wide Web described above. Typically, all of the A .„ . , c c 

. . . .... As will be readily appreciated trom the foregoing 

computers on an intranet are contained within a company or <(1 . . . , i «_ i r . • ■ i . r 

. . description, a system and method for retrieving data from 

* . Web documents on a computer network provide a way of 

Web crawlers are computer programs that automatically re|rieving and storing information pertaining to a Web 

retrieve numerous Web documents from one or more Web documem> wherein the in f orma tion is not easily obtainable 

sites^ A Web crawler processes the received data, preparing usi a sin , £ rctrieval ^ ^ ne , work addrcss The 

the data to be subsequently processed by other programs. 55 in^on aUows a W eb crawler to retrieve file system 

For example, a Web crawler may use the retrieved data to informatioD> such „ an access Ust> corre sponding to a Web 

create an index of documents available over the Internet or document> wherein the Web docume nt is originally retrieved 

an intranet. A "search engine can later use the index to usi a ol tbat does QOt jde the file , em 

locate Web documents that satisfy a specified criteria. information. The invention also allows data from two dis- 

Web crawlers use the same protocols as other programs, 60 t i nct web documents to be combined, wherein a primary 

such as Web browsers and file system explorers, to access W eb document has a corresponding secondary Web docu- 

Web documents. The type of data that a Web crawler mem containing information pertaining to the primary Web 

retrieves is determined by the protocol used. For example, document, 
the HTTP protocol does not provide a mechanism to obtain 

an access control list corresponding to a Web document. In 65 BRIEF DESCRIPTION OF THE DRAWINGS 

another example, a Web document may have an associated The foregoing aspects and many of the attendant advan- 

second Web document at a different address, the second Web tages of this invention will become more readily appreciated 
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as the same becomes better understood by reference to the 
following detailed description, when taken in conjunction 
with the accompanying drawings,, wherein: 

FIG. 1 is a block diagram of a general purpose computer 
system for implementing the present invention; 5 

FIG. 2 is a block diagram illustrating a network 
architecture, in accordance with the present invention; 

FIG. 3 is a block diagram illustrating a architecture of a 
Web crawler program, in accordance with the present inven- 
tion; 

FIG. 4 illustrates a data structure used to map addresses, 
in accordance with the present invention; and 

FIG. 5 is a flow diagram illustrating the process of 
retrieving information pertaining to a Web document. 15 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

The present invention is a mechanism for obtaining 
information pertaining to Web documents that reside on one 2 o 
or more server computers. A server computer is referred to 
as a Web site, and the process of locating and retrieving 
digital data from Web sites is referred to as "Web crawling." 
The mechanism of the invention uses a table to associate 
Web address prefixes with a corresponding prefix that, if 25 
substituted in the original address, may yield another 
address with a different protocol, network site, or path. The 
crawler users the corresponding addresses or protocols to 
obtain information that supplements the data received by 
directly accessing the document using the documents pri- 30 
mary address. 

In accordance with the present invention, a Web crawler 
program executes on a computer, preferably a general pur- 
pose personal computer. FIG. 1 and the following discussion 
are intended to provide a brief, general description of a 35 
suitable computing environment in which the invention may 
be implemented. Although not required, the invention •will 
be described in the general context of computer-executable 
instructions, such as program modules, being executed by a 
personal computer. Generally, program modules include 40 
routines, programs, objects, components, data structures, 
etc. that perform particular tasks or implement particular 
abstract data types. Moreover, those skilled in the art will 
appreciate that the invention may be practiced with other 
computer system configurations, including hand-held 45 
devices, multiprocessor systems, microprocessor-based or 
programmable consumer electronics, network PCs, 
minicomputers, mainframe computers, and the like. The 
invention may also be practiced in distributed computing 
environments where tasks are performed by remote process- 50 
ing devices that are linked through a communications net- 
work. In a distributed computing environment, program 
modules may be located in both local and remote memory 
storage devices. 

With reference to FIG. 1, an exemplary system for imple- 55 
menting the invention includes a general purpose computing 
device in the form of a conventional personal computer 20, 
including a processing unit 21, a system memory 22, and a 
system bus 23 that couples various system components 
including the system memory to the processing unit 21. The 60 
system bus 23 may be any of several types of bus structures 
including a memory bus or memory controller, a peripheral 
bus, and a local bus using any of a variety of bus architec- 
tures. The system memory includes read only memory 
(ROM) 24 and random access memory (RAM) 25. A basic 65 
input/output system 26 (BIOS), containing the basic routines 
that helps to transfer information between elements within 



the personal computer 20, such as during startup, is stored in 
ROM 24. The personal computer 20 further includes a hard 
disk drive 27 for reading from and writing to a hard disk, not 
shown, a magnetic disk drive 28 for reading from or writing 
to a removable magnetic disk 29, and an optical disk drive 
30 for reading from or writing to a removable optical disk 31 
such as a CD ROM or other optical media. The hard disk 
drive 27, magnetic disk drive 28, and optical disk drive 30 
are connected to the system bus 23 by a hard disk drive 
interface 32, a magnetic disk drive interface 33, and an 
optical drive interface 34, respectively. The drives and their 
associated computer-readable media provide nonvolatile 
storage of computer readable instructions, data structures, 
program modules and other data for the personal computer 
20. Although the exemplary environment described herein 
employs a hard disk, a removable magnetic disk 29 and a 
removable optical disk 31, it should be appreciated by those 
skilled in the art that other types of computer-readable media 
which can store data that is accessible by a computer, such 
as magnetic cassettes, flash memory cards, digital versatile 
disks, Bernoulli cartridges, random access memories 
(RAMs), read only memories (ROM), and the like, may also 
be used in the exemplary operating environment. 

A number of program modules may be stored on the hard 
disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 
including an operating system 35, one or more application 
programs 36, other program modules 37, and program data 
38. A user may enter commands and information into the 
personal computer 20 through input devices such as a 
keyboard 40 and pointing device 42. Other input devices 
(not shown) may include a microphone, joystick, game pad, 
satellite dish, scanner, or the like. These and other input 
devices are often connected to the processing unit 21 
through a serial port interface 46 that is coupled to the 
system bus, but may be connected by other interfaces, such 
as a parallel port, game port or a universal serial bus (USB). 
A monitor 47 or other type of display device is also 
connected to the system bus 23 via an interface, such as a 
video adapter 48. One or more speakers 57 are also con- 
nected to the system bus 23 via an interface, such as an audio 
adapter 56. In addition to the monitor and speakers, personal 
computers typically include other peripheral output devices 
(not shown), such as printers. 

The personal computer 20 operates in a networked envi- 
ronment using logical connections to one or more remote 
computers, such as remote computers 49 and 60. Each 
remote computer 49 or 60 may be another personal 
computer, a server, a router, a network PC, a peer device or 
other common network node, and typically includes many or 
all of the elements described above relative to the personal 
computer 20, although only a memory storage device 50 or 
61 has been illustrated in FIG. 1. The logical connections 
depicted in FIG. 1 include a local area network (LAN) 51 
and a wide area network (WAN) 52, Such networking 
environments are commonplace in offices, enterprise-wide 
computer networks, intranets and the Internet. As depicted in 
FIG. 1, the remote computer 60 communicates with the 
personal computer 20 via the local area network 51. The 
remote computer 49 communicates with the personal com- 
puter 20 via the wide area network 52. 

When used in a LAN networking environment, the per- 
sonal computer 20 is connected to the local network 51 
through a network interface or adapter 53. When used in a 
WAN networking environment, the personal computer 20 
typically includes a modem 54 or other means for establish- 
ing communications over the wide area network 52, such as 
the Internet. The modem 54, which may be internal or 
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external, is connected to the system bus 23 via the serial port prefix corresponding to the primary address prefix. The Web 

interface 46. In a networked environment, program modules crawler 206 uses the secondary address prefix to build a 

depicted relative to the personal computer 20, or portions secondary address. The secondary address is used to retrieve 

thereof, may be stored in the remote memory storage device. information pertaining to a document, in order to augment or 

It will be appreciated that the network connections shown 5 replace the information retrieved by using the primary 

are exemplary and other means of establishing a communi- address. This process is described in further detail below, 

cations link between the computers may be used. As will be readily understood by those skilled in the art of 

FIG. 2 illustrates an architecture of a networked system in computer network systems, and others, the system illustrated 

which the present invention operates. A server computer 204 in FIG. 2 is exemplary, and alternative configurations may 

includes a Web crawler program 206 executing thereon. The 10 also be used in accordance with the invention. For example, 

Web crawler program 206 searches for Web documents the server computer 204 itself may include Web documents 

distributed on one or more computers connected to a com- 232 and 234 that arc accessed by the Web crawler program 

puter network 216, such as the remote server computer 218 206. Also the Web crawler program 206, the indexing engine 

depicted in FIG. 2. The computer network 216 may be a 208, and the search engine 230 may reside on different 

local area network 51 (FIG. 1), a wide area network 52, or 15 computers. Additionally, the Web browser program and the 

a combination of networks that allow the server computer Web crawler program 206 may reside on a single computer. 

204 to communicate with remote computers, such as the Further, the indexing engine 208 and search engine 230 are 

remote server computer 218, either directly or indirectly. not required by the present invention. The Web crawler 

The server computer 204 and the remote server computer program 206 may retrieve Web document information for 

218 are preferably similar to the personal computer 20 20 usages other than providing the information to a search 

depicted in FIG. 1 and discussed above. engine. As discussed above, the client computer 214, the 

The Web crawler program 206 searches remote server server computer 204, and the remote server computer 218 

computers 218 connected to the network 216 for Web may communicate through any type of communication 

documents 222 and 224. The Web crawler 206 retrieves Web network or communications medium, 

documents and associated data. The contents of the Web 2 s FIG. 3 illustrates, in further detail, a Web crawler program 

documents 222 and 224, along with the associated data, can 206 and related software executing on the server computer 

be used in a variety of ways. For example, the Web crawler 204 (FIG. 2) that performs Web crawling and indexing of 

206 may pass the information to an indexing engine 208. An information in accordance with the present invention. As 

indexing engine 208 is a computer program that maintains illustrated in FIG. 3, the Web crawler program 206 includes 

an index 210 of Web documents. The index 210 is similar to 30 a "gatherer" process 304 that performs crawling of the Web 

the index in a book, and contains reference information and and gathering of information pertaining to Web documents, 

pointers to corresponding Web documents to which the The gatherer process 304 is invoked by passing it one or 

reference information applies. For example, the index may more starting URLs 306. The starting URLs 306 serve as 

include keywords, and for each keyword a list of addresses. seeds, instructing the gatherer process 304 where to begin its 

Each address can be used to locate a document that includes 35 Web crawling process. A starting URL can be a universal 

the keyword. The index may also include information other naming convention (UNC) directory, a UNC path to a file, 

than keywords used within the Web documents. For or an HTTP path to a file. A URL, or Web document address, 

example, the index 210 may include subject headings or comprises specifications of a protocol, a domain, and a path 

category names, even when the literal subject heading or within the domain. The domain is also referred to as the host, 

category name is not included within the Web document. 40 In one actual embodiment of the invention, the protocol and 

The type of info rmation stored in the index depends upon the domain specifications form an address prefix. As will be 

complexity of the search engine, which may analyze the understood by those skilled in the art of computer 

contents of the Web document and store the results of the programming, and others, the invention can be used with 

analysis. different address schemes. 

A client computer 214, such as the personal computer 20 45 The gatherer process 304 inserts the starting URLs 306 

(FIG. 1), Ls connected to the server computer 204 by a into a transaction log 310, which maintains a list of URLs 

computer network 212. The computer network 212 may be that are currently being processed or have not yet been 

a local area network, a wide area network, or a combination processed. The transaction log 310 functions as a queue. It 

of networks. The computer network 212 may be the same is called a log because it is preferably implemented as a 

network as the computer network 216 or a different network. 50 persistent queue that is written and kept on a disk to enable 

The client computer 214 includes a computer program, such recovery after a system failure. Preferably, the transaction 

as a "browser" 215 that locates and displays documents to a queue maintains a small in-memory cache for quick access 

user. When a user at the client computer 214 desires to to the next transactions. 

search for one or more Web documents, the client computer The gatherer process 304 also maintains a history map 

transmits data to a search engine 230 requesting a search. At 55 308, which contains an ongoing list of all URLs that have 

that time, the search engine 230 examines its associated been searched during the current Web crawl. The gatherer 

index 210 to find documents that may be desired by the user. process 304 includes one or more worker threads 312 that 

The search engine 230 may then return a list of documents process each URL. The worker thread 312 retrieves a URL 

to the browser 215 at the client computer 214. The user may from the transaction log 310 and passes the URL to a filter 

then examine the lust of documents and retrieve one or more 60 daemon 314. The filter daemon 314 is a process that uses the 

desired Web documents from remote computers such as the URL to retrieve the Web document at the address specified 

remote server computer 218. by the URL. The filter daemon 314 uses the access method 

The Web crawler program 206 maintains an address map specified by the URL to retrieve the Web document. For 

226. The address map 226 is a simple database that contains example, if the access method is HTTP, the filter daemon 

a list of Web document address prefixes. For each Web 65 314 uses HTrP commands to retrieve the document. If the 

document address prefix, referred to as a "primary" address access method specified is FILE, the filter daemon uses file 

prefix, the address map 226 contains a "secondary" address system commands to retrieve the corresponding documents. 
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The File Transfer Protocol (FTP) is another other well retrieval to data obtained using the secondary address. The 

known access method that the filter daemon may use to data obtained using the secondary address is passed to the 

retrieve a document. Other access protocols, such as data- indexing engine 208. The process of using an address map 

base retrieval specifications, may also be used in conjunc- 226 is illustrated in FIG. 5 and discussed in further detail 

tion with the invention. 5 below. 

After retrieving a Web document, the filter daemon parses FIG * 4 illustrates an exemplary address map 226. As 

the Web document and returns a list of text and properties. ill L u ? tra < e !? l ° 4 the address map 226 includes a hash 

An HTMLdocument includes a sequence of "tags," each tag ^ 322 ' Each h * s fi h f^™^** 0 "** i° 

t . . • c .. T. ■ f . . . address mappings 328, 330, 332, 334. The use of hash tables 

containing some information. The information may be text m ■ .i_ . J- . r i ■ j . 1 1_ 

• . • • • ] , . . „. . . J in is well known in the art and is not discussed in detail herein, 

that is to be displayed in the Web browser program 215 io . . 

rn, . r . , except as necessary to explain the invention. 

(FIG. 2). The information may also be "metadata that / . f -~ 0 . . 

j m_ .lt „• r_ - t-l • r .■ * An address mapping entry 328 contains a primary address 

describes the formatting of ext. The inform alio n within tags fix m and a ^ d ; ddress efix 33 H 8 . ^primary 

may also contain hyperlinks to other Web documents A fij£ 336 c0 rises a otocoI specification 340 

hyperlink includes a specification of a Web address. If the and a lop level domain S p ecifical i on 542. The secondary 

tag containing a hyperlink is an image, the Web browser is address prefix 338 also contains a protocol specification 344 

program 215 uses the hyperlink to retrieve the image and and a top leve i do n_ a i n specification 346. Using an address 

render it on the Web page. Similarly, the hyperlink may mapping entry 328, 330, 332, or 334, the worker thread 312 

specify the address of audio data. If a hyperlink points to retrieves a primary address prefix from a URL, and finds a 

audio data, the Web browser program retrieves the audio corresponding secondary address prefix 338 in the address 

data and plays it. 20 map 320. The worker thread then creates a second, complete 

An "anchor" tag specifies a visual element and a hyper- address by replacing the primary address prefix in the 

link. The visual element may be text or a hyperlink to an original URL with the secondary address prefix, as discussed 

image. When a user selects an anchor having an associated below. 

hyperlink in a Web browser program 215, the Web browser An address prefix can also include a directory specifica- 

program automatically retrieves a Web document at the 25 Uon - l f a primary address prefix 336 includes a directory 

address specified in the hyperlink. specification, the corresponding secondary address prefix 

Tags may also contain information intended for a search includ f s a dire j ctorv specification, which is used to create the 

engine. For example, a tag may include a subject or category secondary address. An address prefix may further include a 

within which the Web document falls, to assist search file specification. In such a situation, the corresponding 

engines that perform searches by subject or category. The 30 secondary address prefix specifies the entire secondary 

information contained in tags is referred to as "properties" of address ' ^ de P lcted * FIG - 4 ' } he addr ^ ma PP in & 334 

the Web document. A Web document is therefore considered inchldes a P™*? a ^' ess P re ? x 35 ,° thal comprises a 

to be made up of a set of properties and text. The filter protocol specification 352 a lop level domain specification 

daemon 314 returns the list of properties and text within a 354 > and a directory specification 356. The correspondence 

Web document to the worker thread 312. 35 secondary address specification 358 comprises a protocol 

A ,. , , „, . , . . specification 360, a top level domain specification 362, and 

As discussed above, a Web document may contain one or r ,. .1 .. ->, A 

, r 1 th. * 1- . c . ■ ' 1 j a directory specification 364. 

more hyperlinks. Therefore, the list of properties includes a A , , , J . 

list of URLs that are included in hyperlinks within the Web ^ fddress map 226 can be created and maintained in 

document. The worker thread 312 passes this list of URLs to an se f veral sample, a user can manually enter a list 

the history map 308. The history map 308 checks each URL 40 of P"™'* address P rehx and se«>"d address prefix pairings 

to determine if it is already listed within the history map. each tlme a new en,r y 15 de * u<td - Alternatively, a user can 

URLs that are not already listed are added to the history map wn,e a rom P"* r P ro 8 ram ,hal *™^ s address , "»PP»>p 

and are also added to the transaction log 310, to be subse- betwee " » ™J P r °! oco aDd an H1TP P rotoco1 when ,he 

quently processed by a worker thread. ««« 204 ^ 2 > 15 a local serw ' 

The worker thread 312 then passes the list of properties 5 "'"f 41 " aD proceSS 502 °f relrievin g 

......... • - AD -n. ■ j • • and storing data using the address map of the present 

and text to the indexing engine 208. The indexing engine & . . . , F , , 

* no . . , - 1 J? . - . - . t u * invention. At a step 504, the worker thread 312 (FIG. 3) 

208 creates an index 210, which is used by the search engine . ITriT „ F , , , -«/ 

. . i retrieves a URL from the transaction log 310. At a step 506, 

230 in subsequent searches. 4 , 1V , . , . «/ u j ♦ ■ .u 

, . . , . , „. . , the Web crawler retrieves a Web document using the pro- 

In accordance with the present invention, the Web crawler 50 tocol and address specifi ed in the URL. In one actual 

also maintains an address map 226 that contains a set of erabodimenti the workcr lhrcad 312 paS ses the URL to the 

address prefix pairs^ Each address prefix pair contains a fihcr daemon 3M> which retrieves the Web document nis 

primary address prefix which forms a portion of a primary fa formed mi lhe s med file relrieval prot ocol, 

address, and a secondary address prefix. The secondary such M HTn>j ^ 0f FIL£ Tfae ^6 also includes 

address prefix is substituted for the primary address prefix in 55 retrieving the data from the Web document. The retrieval of 

a primary address to create a secondary address and obtain daU mdudes ^ (he documen t, identifying each lag, 

information pertaming to the Web document located at the and filter i n g 0Ul unnecessary tags and data. At a step 508, the 

primary address. In one actual implementation, the address WOfker thread 312 ^ the fifSt addfess efix tQ be the entire 

map 226 is implemented as a hash table, and primary address ified b the URL M discussed abovc the first 

address prefixes are hashed to locate entries within the table. 6 o address preflx lherefore inc i udes a protocol, a top level 

During the processing of a primary address corresponding doma in, an optional directory specification, and a file name, 

to a Web document, the Web crawler checks the address map M URLs are typically formatted, the characters prior to the 

226 to determine if there is an associated secondary address first colon specify the prolocoL For example, in the URL 
prefix. If there is a secondary address prefix, the Web crawler 

retrieves dala pertaining lo the Web document by using a 65 hitp ://www. microsoft com/docs/pagci .html 

secondary address. The Web crawler may retrieve the Web "http" specifies the protocol. r Yhe top level domain is speci- 

document using the primary address, or it may limit the data fied by the character string between "://" and the next single 
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slash. In the above example, "www.microsoft.com" is the 
top level domain. The remainder of the URL specifies a 
directory path and a file name. In the above example, 
"docs/p agel.html" specifics the path to a file named 
"pagel.html." 5 

At a step 510, a determination is made of whether the 
address prefix is in the address map 226 as a primary address 
prefix 336 (FIG. 4). If it is, at a step 512, the secondary 
address prefix 338 corresponding to the primary address 
prefix 336 is retrieved. At a step 514, a secondary address is 10 
built by combining the primary address with the secondary 
address prefix. When the primary address prefix is the entire 
primary address, the secondary address prefix becomes the 
secondary address. Although the invention as described 
utilizes address prefixes that may comprise a portion of an 35 
address, the mechanism of the invention may be applied to 
address mappings where each entry in the address map has 
a primary address and a corresponding secondary address. 

A primary address can have a plurality of corresponding 
addresses that are used to obtain data. For example, an entry 20 
in the address map may include a primary address prefix, a 
secondary address prefix, and a tertiary address prefix. The 
number of addresses corresponding to a primary address 
may vary with each primary address. It should be readily 
apparent to one skilled in the art of computer programming, 25 
and others, that the process 502 of retrieving and storing data 
can be modified to accommodate multiple and varying 
numbers of addresses corresponding to a primary address. 

After building a new URL, at a step 516, the worker 
thread 312 passes the secondary address to the filter daemon 30 
314, which retrieves additional data using the secondary 
address. The secondary data may be a Web document or 
system level data pertaining to the document. For example, 
if the secondary address uses the protocol FILE, at step 516, 
the mechanism of the invention may retrieve such data as the 35 
time stamp of the last file update or an access control list. An 
access control list specifies the set of users that has security 
access to the file. 

At step 518, the data retrieved at step 516 is combined 
with the data retrieved at step 506. At step 520, the combined 40 
data is used. For example, as illustrated in FIG. 3, the worker 
thread may pass the combined data to an indexing engine 
208 for insertion into an index 210, to be subsequently used 
by a search engine 230. 

If at the step 510 it is determined that the URL prefix docs 45 
not exist in the address map as a primary address prefix, at 
a step 522, the worker thread 312 obtains the next address 
prefix by reducing the address prefix from the right side until 
a slash is found. At a step 524, a determination is made of 
whether a top level domain specification still remains in the 50 
new address prefix. If it does not, flow control proceeds to 
step 520, where the data retrieved at step 506 is entered into 
the index 210, without secondary data. If, at the step 524, a 
top level domain still remains in the new address prefix, flow 
control returns to the step 510, to search for the new address 55 
prefix in the address map. 

After the address prefix is reduced to exclude the file 
specification, at the step 514, the retrieved secondary 
address prefix is combined with the directory and file 
specification "below" the address prefix to build a new 60 
secondary URL. For example, if the URL is "http:// 
www.microsoft.com/docs/pagel. html," the address prefix 
being searched at the step 510 is "http:// 
www.microsoft.com," and the secondary address prefix cor- 
responding to the primary address prefix is "file://msserver," 65 
the new secondary URL is "file://msserver/docs/ 
pagel.html." In the exemplary process 502 depicted in FIG. 
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5, the secondary address prefix corresponding to the longest 
address prefix of a URL is used if there are more than one 
primary address prefixes for a URL that have entries in the 
address map. 

FIG. 5 illustrates a process 502 of obtaining information 
pertaining to a single address. As discussed above, a Web 
crawler repeats this process for many URLs, as it uses the 
links within each Web document to traverse a network of 
Web documents. 

While the preferred embodiment of the invention has been 
illustrated and described, it will be appreciated that various 
changes can be made therein without departing from the 
spirit and scope of the invention. 

The embodiments of the invention in which an exclusive 
property or privilege is claimed are defined as follows: 

1. A computer-based method of retrieving Web document 
information from a computer network, comprising: 

retrieving a Web document from a computer network 
using a first protocol included in a primary document 
address specification; 

obtaining data from the Web document; 

determining whether the primary document address speci- 
fication has a corresponding secondary document 
address specification; and 

if the primary document address specification has a cor- 
responding secondary document address specification, 
retrieving supplementary data from the computer net- 
work pertaining to the Web document using a second 
protocol included in the secondary document address 
specification. 

2. The method of claim 1, wherein the primary document 
address specification includes the first protocol and a first 
network address, and the secondary document address speci- 
fication includes the second protocol and a second network 
address, and wherein the first protocol is different from the 
second protocol. 

3. The method of claim 2, wherein the first protocol is 
HTTP and the second protocol is FILE. 

4. The method of claim 3, further comprising retrieving an 
access control list corresponding to the Web document by 
using the secondary document address specification. 

5. The method of claim 4, further comprising storing the 
supplementary data pertaining to the Web document 
retrieved using the second protocol from the secondary 
document address specification with the data retrieved from 
the Web document using the first protocol from the primary 
document address specification in a document index. 

6. The method of claim 1, wherein the primary document 
address specification includes the first protocol and a first 
network address, and the secondary document address speci- 
fication includes the second protocol and a second network 
address, and the first network address is different from the 
second network address. 

7. The method of claim 6, wherein retrieving supplemen- 
tary data pertaining to the Web document by using the 
secondary document address specification comprises 
retrieving a second Web document that includes supplemen- 
tary data pertaining to the Web document. 

8. The method of claim 7, wherein retrieving the second 
Web document includes using the hypertext transfer proto- 
col (HTTP) to retrieve the second Web document. 

9. The method of claim 7, wherein retrieving the second 
Web document includes using a database specification to 
retrieve the second Web document. 

10. The method of claim 1, wherein determining whether 
the primary document address specification has a corre- 
sponding secondary document address specification 
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includes determining whether an entry corresponding to the 
primary document address specification exists in an address 
map. 

11. The method of claim 10, wherein the entry corre- 
sponding to the primary document address specification 5 
includes a transfer protocol specification and a top level 
domain specification. 

12. The method of claim 1, further comprising: 
determining whether the primary document address speci- 
fication has a corresponding tertiary document address 10 
specification; 

if the primary document address specification has a cor- 
responding tertiary document address specification, 
retrieving further data pertaining to the Web document 
by using the tertiary document specification; and 

if the primary document address specification has a cor- 
responding tertiary document address specification, 
storing the further data pertaining to the Web document 
obtained using the tertiary document address specifi- 2Q 
cation with the data obtained from the Web document. 

13. The method of claim 1, wherein the secondary docu- 
ment address specification is automatically built by replac- 
ing a secondary address prefix for a primary address prefix 

in the primary document address specification. 25 

14. The method of claim 13, further comprising: 

(a) obtaining a URL from a transaction log; 

(b) parsing the URL into a URL prefix and URL suffix; 

(c) providing an address map containing a plurality of 
primary address prefixes and corresponding secondary 30 
address prefixes; 

(d) determining if the URL prefix is included in the 
address map as a primary address prefix; 

(i) if the URL prefix is included in the address map as 

a primary address prefix, combining a secondary 35 
address prefix that corresponds to the primary 
address prefix with the URL suffix to build the 
secondary document address specification; and 

(ii) if the URL prefix is not included in the address map 

as a primary address prefix, changing the parsing of 40 
the URL to incrementally reduce the URL prefix and 
increase the URL suffix and then repeating this 
paragraph (d). 

15. A computer-based method of retrieving information 
from a computer network during a network crawl, compris- 45 
ing: 

retrieving an electronic document from the computer 
network, the electronic document including at least one 
hyperlink specification including a primary document 
address specification; 

retrieving at least one primary document address speci- 
fication from the electronic document using a first 
protocol included in the primary address specification, 
each primary document address corresponding to a 
linked electronic document; 

determining whether the primary document address speci- 
fication has a corresponding secondary document 
address specification; 

if the primary document address specification has a cor- go 
responding secondary document address specification, 
retrieving supplementary data pertaining to the linked 
electronic document from the computer network using 
a second protocol included in the secondary document 
address specification; and 55 

if the primary document address specification has a cor- 
responding secondary document address specification, 



50 



55 



storing the supplementary data pertaining to the linked 
electronic document obtained using the secondary 
document address specification and associating the 
stored supplementary data pertaining to the linked 
electronic document with the primary document 
address specification. 

16. The method of claim 15, wherein the primary docu- 
ment address specification includes the first protocol and a 
first network address, and the secondary document address 
specification includes the second protocol and a second 
network address, and wherein the first protocol is different 
from the second protocol. 

17. The method of claim 15, wherein the primary docu- 
ment address specification includes the first protocol and a 
first network address, and the secondary document address 
specification includes the second protocol and a second 
network address, and the first network address is different 
from the second network address. 

18. The method of claim 15, wherein determining whether 
the primary document address specification has a corre- 
sponding secondary document address specification 
includes determining whether an entry corresponding to the 
primary document address specification exists in an address 
map. 

19. The method of claim 15, further comprising: 
retrieving data from the linked electronic document using 

the primary document address specification; 
storing the data retrieved from the linked electronic 
document using the primary document address speci- 
fication; and 

associating the data retrieved from the linked electronic 
document using the primary document address speci- 
fication with the supplementary data pertaining to the 
linked electronic document retrieved using the second- 
ary document address specification. 

20. The method of claim 15, further comprising: 
automatically retrieving a plurality of primary document 

address specifications from a plurality of hyperlinks 
included in the electronic document; 

automatically retrieving a plurality of secondary docu- 
ment address specifications corresponding to said plu- 
rality of primary document address specifications; and 

automatically retrieving supplementary data pertaining to 
the linked electronic document using said plurality of 
secondary document address specifications. 

21. The method of claim 15, wherein the secondary 
address specification is automatically built by replacing a 
secondary address prefix for a primary address prefix in the 
primary document address specification. 

22. The method of claim 21, further comprising: 

(a) obtaining a URL from a transaction log; 

(b) parsing the URL into a URL prefix and URL suffix; 

(c) providing an address map containing a plurality of 
primary address prefixes and corresponding secondary 
address prefixes; 

(d) determining if the URL prefix is included in the 
address map as a primary address prefix; 

(i) if the URL prefix is included in the address map as 
a primary address prefix, combining a secondary 
address prefix that corresponds to the primary 
address prefix with the URL suffix to build the 
secondary address specification; and 

(ii) if the URL prefix is not included in the address map 
as a primary address prefix, changing the parsing of 
the URL to incrementally reduce the URL prefix and 
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increase the URL suffix and then repealing this determining whether the primary Web address has a 

paragraph (d). corresponding secondary Web address; 

23. A system for performing a Web crawl, the system selectively retrieving supplementary information per- 

comprising: taining to said one of the Web documents using the 

a server computer having a Web crawler program execut- 5 corresponding secondary Web address; and 

ing thereon; ^ tne supplementary information pertaining to said one 

an address map accessible to the Web crawler program of the Web documents is retrieved using the second- 

and containing a plurality of primary Web addresses a ry Web address, storing said supplementary infor- 

and a plurality of secondary Web addresses, each mation in the database. 

primary Web address having a corresponding second- 30 2 4. The system of claim 23, further comprising a search 

ary Web address; engine that performs a Web search using the database. 

the primary Web address including a first protocol for the 25. The system of claim 24, wherein the first protocol is 

retrieval of a Web document at the primary Web HTTP, the second protocol is FILE, and the retrieving of 

address; 35 supplementary information pertaining to said one of the Web 

the secondary Web address including a second protocol documents using the secondary Web address includes using 

for the retrieval of a Web document at the secondary file system commands to retrieve the supplementary infor- 

Web address; mation pertaining to said one of the Web documents. 

the second protocol for the retrieval of a Web document 26. The system of claim 25, wherein the supplementary 

at the secondary Web address being different than the 20 information pertaining to said one of the Web documents 

first protocol for the retrieval of a Web document at the includes an access control list. 

primary Web address; 27. The system of claim 23, wherein each primary Web 
a computer network including at least one Web server address includes a data transfer protocol specification and a 
having a plurality of Web documents stored thereon, top level domain specification, and each secondary Web 
each Web document having a corresponding primary 25 address includes a data transfer protocol specification and a 
Web address; t Q p ] eV el domain specification, 
a database containing information pertaining to the plu- 28. The system of claim 23, wherein the address map and 
rality of Web documents; said plurality of Web documents reside on different corn- 
program code for: 3Q puters. 
retrieving a primary Web address corresponding to one 
of the Web documents; ***** 
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